setwd('/Users/billyge/Desktop/Emory Fall 2022/QTM 385-4/problem sets/ps github/ps3')
#setwd('/Users/annie/Desktop/QTM385/QTM350_Github/QTM350 Github/QTM385/ps3')
rm(list = ls())
c_test <- read.csv('college_test.csv')
c_train <- read.csv('college_train.csv')
o_test <- read.csv('office_test.csv')
o_train <- read.csv('office_train.csv')

This is the third problem set for QTM 385 - Intro to Statistical Learning. This homework will cover applied exercises related to predictor selection for linear regression models.

Please use the intro to RMarkdown posted in the Intro module and my .Rmd file as a guide for writing up your answers. You can use any language you want, but I think that a number of the computational problems are easier in R. Please post any questions about the content of this problem set or RMarkdown questions to the corresponding discussion board.

Your final deliverable should be two files: 1) a .Rmd/.ipynb file and 2) either a rendered HTML file or a PDF. Students can complete this assignment in groups of up to 3. Please identify your collaborators at the top of your document. All students should turn in a copy of the solutions, but your solutions can be identical to those of your collaborators.

This assignment is due by October 3rd, 2022 at 11:59 PM EST.

Collaborators: Billy Ge, Latifa Tan, Annie Luo


Problem 1: An Important Ridge and LASSO Identity (15 pts)

Both Ridge and LASSO present regularized solutions to the ill-posed variable selection problem for the linear model. Recall that we can view each of these metthods as approaches to estimating coefficients that minimize different loss functions:

\[\hat{\boldsymbol \beta}_{OLS} = \underset{\boldsymbol \beta}{\text{argmin }} (\boldsymbol y - \mathbf X \boldsymbol \beta)'(\boldsymbol y - \mathbf X \boldsymbol \beta)\]

\[\hat{\boldsymbol \beta}_{Ridge} = \underset{\boldsymbol \beta}{\text{argmin }} (\boldsymbol y - \mathbf X \boldsymbol \beta)'(\boldsymbol y - \mathbf X \boldsymbol \beta) + \sum \limits_{j = 1}^P \beta_j^2\]

\[\hat{\boldsymbol \beta}_{LASSO} = \underset{\boldsymbol \beta}{\text{argmin }} (\boldsymbol y - \mathbf X \boldsymbol \beta)'(\boldsymbol y - \mathbf X \boldsymbol \beta) + \sum \limits_{j = 1}^P |\beta_j|\]

There is a relationship between the optimal solutions for the regression coefficients under no penalty (the OLS solution), the ridge penalty, and the LASSO penalty. The exact relationship cannot be derived for most cases, but we can gain some knowledge by assuming that the predictors are exactly orthogonal to one another. Since we can always rescale the variance of the features, we can further restrict this to feature sets that have orthonormal columns - \(\boldsymbol{X'X} = \mathcal{I}_P\).

Assuming the feature matrix, \(\boldsymbol{X}\), has orthnormal columns, show that:

\[\hat{\beta}_{OLS} = \boldsymbol{X'y} \text{ ; } \hat{\beta}_{Ridge} = \frac{\hat{\beta}_{OLS}}{1 + \lambda} \text{ ; } \hat{\beta}_{LASSO} = \text{sign}(\hat{\beta}_{OLS}) \times \text{max}(|\hat{\beta}_{OLS}| - \lambda , 0)\]

What do these equations show about how ridge and LASSO shrink coefficients towards zero? Why can the LASSO set a coefficient to zero while ridge cannot?

Notes:

  1. \(\sum \limits_{j = 1}^P \beta_j^2\) can also be expressed as \(\beta'\beta\).

  2. Substitute \(\hat{\beta}_{OLS} = \boldsymbol{X'y}\) whenever possible.

  3. It is probably easier to drop everything to elements of the coefficient vector for the LASSO at a certain point - e.g. work with \(\beta_j\) instead of \(\boldsymbol{\beta}\). Since everything becomes a sum or sum of squares or sum of absolute values, you can separate the problem out easier this way.

  4. You can take it as a given that the sign of \(\hat{\beta}_{OLS,j}\) is the same as \(\hat{\beta}_{LASSO,j}\) (unless the LASSO solution is zero, but it won’t matter there).

  5. These solutions can be found online and in various textbook resources. I think this is a great exercise for thinking through complex minimization problems, so don’t spoil yourself unless you really find yourself stuck.

Problem 1 solution:

OLS: \[ \begin{eqnarray} \frac{\partial \left[ \left(\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta} \right)' \left(\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta} \right) \right]}{\partial \boldsymbol{\beta}} &=& 0\\ (-2) \boldsymbol{X}' \left(\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta} \right) &=& 0\\ \boldsymbol{X}'\boldsymbol{y} - \boldsymbol{X}'\boldsymbol{X}\boldsymbol{\beta} &=& 0\\ \boldsymbol{\hat{\beta}_{OLS}} &=& \left(\boldsymbol{X}'\boldsymbol{X} \right)^{-1}\boldsymbol{X}'\boldsymbol{y}\\ \end{eqnarray} \] Since we are assuming orthonormal columns, we plug in \(\boldsymbol{X'X} = \mathcal{I}_P\) \[ \begin{eqnarray} \boldsymbol{\hat{\beta}_{OLS}} &=& \left(\mathcal{I}_P \right)^{-1}\boldsymbol{X}'\boldsymbol{y}\\ \boldsymbol{\hat{\beta}_{OLS}} &=& \boldsymbol{X}'\boldsymbol{y} \end{eqnarray} \] Ridge: \[ \begin{eqnarray} \frac{\partial \left[ \left(\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta} \right)' \left(\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta} \right) + \lambda\sum \limits_{j = 1}^P \beta_j^2 \right]}{\partial \boldsymbol{\beta}} &=& 0\\ (-2) \boldsymbol{X}' \left(\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta} \right) + 2\lambda\boldsymbol{\beta} &=& 0\\ \lambda\boldsymbol{\beta} &=& \boldsymbol{X}'\boldsymbol{y} - \boldsymbol{X}'\boldsymbol{X}\boldsymbol{\beta} \end{eqnarray} \] Plug in \(\boldsymbol{X'X} = \mathcal{I}_P\) and \(\boldsymbol{\hat{\beta}_{OLS}} = \boldsymbol{X}'\boldsymbol{y}\), we get \[ \begin{eqnarray} \lambda\boldsymbol{\beta} &=& \boldsymbol{\hat{\beta}_{OLS}} - \mathcal{I}_P\boldsymbol{\beta}\\ \lambda\boldsymbol{\beta}+\boldsymbol{\beta} &=& \boldsymbol{\hat{\beta}_{OLS}}\\ \boldsymbol{\hat{\beta}_{Ridge}} &=& \frac{\boldsymbol{\hat{\beta}_{OLS}}}{1+\lambda} \end{eqnarray} \] LASSO: For each \(\beta_j\): \[ \begin{eqnarray} \hat{\beta}_{j,LASSO} &=& \underset{\boldsymbol \beta_j}{\text{argmin }} (y_j - \mathbf x_j \beta_j)^2 + \lambda * |\beta_j|\\ \frac{\partial \left[ \hat{\beta}_{j,LASSO} \right]}{\partial {\beta_j}} &=& 2\left( -\boldsymbol x_j \right)'\left(y_j-\boldsymbol {x_j} \beta_j \right)+\lambda * \text{sign} \left(\beta_j \right) = 0\\ \lambda * \text{sign} \left(\beta_j \right) &=& 2\boldsymbol{x_j}'y_j-2\boldsymbol {x_j}'\boldsymbol {x_j} \beta_j \end{eqnarray} \] Plug in \(\boldsymbol{x_j'x_j} = \mathcal{I}_1\) and \({\hat{\beta}_{j,OLS}} = \boldsymbol{x_j}'{y_j}\), we get \[ \begin{eqnarray} \lambda * \text{sign} \left(\beta_j \right) &=& 2\hat{\beta}_{j,OLS}-2\mathcal{I}_1 \beta_j\\ \beta_j &=& \hat{\beta}_{j,OLS} - \frac{\lambda}{2}\text{sign} \left(\beta_j \right) \end{eqnarray} \] Set constant \(\lambda\) to be \(\frac{\lambda}{2}\), and summing up \(\beta_j\), we get \[ \begin{eqnarray} \hat {\boldsymbol \beta}_{LASSO} &=& \hat{\boldsymbol{\beta}}_{OLS} - \lambda*\text{sign} \left(\beta_j \right) \end{eqnarray} \] When \(\hat{\boldsymbol{\beta}}_{OLS} > 0\), \(\hat {\boldsymbol \beta}_{LASSO} = \hat{\boldsymbol{\beta}}_{OLS} - \lambda\); when \(\hat{\boldsymbol{\beta}}_{OLS} < 0\), \(\hat {\boldsymbol \beta}_{LASSO} = \hat{\boldsymbol{\beta}}_{OLS} + \lambda\). Therefore, we get \[ \begin{eqnarray} \hat {\boldsymbol \beta}_{LASSO} &=& \text{sign} \left(\hat{\boldsymbol{\beta}}_{OLS} \right) * \left(|\hat{\boldsymbol{\beta}}_{OLS}| - \lambda \right) \end{eqnarray} \]

Ridge shrinks coefficients toward 0 because as \(\lambda\) increases, the denominator increases and thus \(\boldsymbol{\hat{\beta}_{Ridge}}\) decreases. However, unless \(\lambda\) goes to infinity, increasing the denominator can never shrink the fraction into exactly 0. LASSO shrinks coefficients toward 0 because as \(\lambda\) increases to a value that is equal to \(|\hat{\boldsymbol{\beta}}_{OLS}|\), \(\hat {\boldsymbol \beta}_{LASSO}\) goes to 0. LASSO coefficients can reach exactly 0 because it’s possible for \(\lambda\) to be exactly \(|\hat{\boldsymbol{\beta}}_{OLS}|\).

Problem 2: Applying Ridge and LASSO (35 pts)

Ridge and LASSO are great methods for parsing through data sets with lots of predictors to find:

  1. An interpretable set of important predictors - which predictors are signal and which ones are just noise
  2. The set of parameters that minimize expected prediction error (with all the caveats that we discussed in the previous lectures)

Where these methods really shine for purpose 1 (and purpose 2, by construction) is when the ratio of predictors to observations approaches 1. To see this and work through an example using pre-built software, let’s try to build a model that predicts IMDB ratings for episodes of the Office (the U.S. Version). office_train.csv includes IMDB ratings (imdb_rating) for 115 episodes of the office and a number of predictors for each episode:

  1. The season of the episode (1 - 9, which should be treated as an unordered categorical variable)
  2. The number of times main characters speak in the episode (andy through jan)
  3. The director of the episode (ken_kwapis through justin_spitzer) already “dummied out” so that a 1 means the person directed the episode and a 0 means they did not

Let’s use this data to build a predictive model for IMDB ratings and check our predictive accuracy on the heldout test set (office_test.csv).

For this problem, you can restrict your search to the set of standard linear models (e.g. no interactions, no basis expansions, etc.). If you would like to try to include more terms to improve the model, you are more than welcome to try!

An important step: Get rid of any features that are observation specific!

Part 1 (10 pts)

Start by limiting yourself to the standard OLS model.

Find the regression coefficients that minimize the training error under squared error loss and use this model to compute the LOOCV estimate of the expected prediction error using all features.

Create a plot that demonstrates the regression coefficients granted by OLS. Which predictors are important? Which ones are not? I recommend using a sideways bar plot - you can see an example construction here.

Select a subset of “important” predictors and find the set of coefficients that minimizes MSE under squared error loss. Compute the LOOCV estimate of the expected prediction error for your smaller model. How does this LOOCV estimate compare to the LOOCV for the full model?

Note: You don’t need to try to perform any methods of subset selection in this step. Just use heuristic methods to determine which coefficients appear to be important for the model!

Part 1 solution:

#delete variable episode_name
o_train <- o_train %>% select(-episode_name)

#split variable season into 9 variables
dummy_o_train <- caret::dummyVars("~.", data=o_train)
o_train <- data.frame(predict(dummy_o_train, newdata=o_train))

#full model
p2p1_m <- lm(imdb_rating ~ ., data=o_train)

#LOOCV
lm_pred_metrics <- function(fit_model) {
  LOOCV <- mean((fit_model$residuals / (1-hatvalues(fit_model)))^2)
  return (LOOCV)
}
lm_pred_metrics(p2p1_m) #LOOCV is 0.2343854
## [1] 0.2343854
#plot
cf <- coef(summary(p2p1_m, complete=TRUE))
cf <- data.frame(rownames(cf), cf)

ggplot(data=cf, aes(x=rownames.cf., y=Estimate)) +
  geom_bar(stat="identity", width=0.75) + coord_flip() +
  labs(x="Features", y="Coefficient Estimate") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5), 
        axis.title.x = element_text(face="bold", colour="red", size = 12),
        axis.title.y = element_text(face="bold", colour="red", size = 12)) 

From the plot, we see that predictors season3, season1, paul_lieberstein, mindy_kaling, justin_spitzer, greg_daniels, and b_j_novak seem important, while the rest don’t.

#subset of important predictors
p2p1_imp_pred <- o_train %>% select(seasonSeason.3, seasonSeason.1, paul_lieberstein, mindy_kaling, justin_spitzer, greg_daniels, b_j_novak, imdb_rating)

#important predictors model
p2p1_imp_pred_m <- lm(imdb_rating ~ ., data=p2p1_imp_pred)

#LOOCV
lm_pred_metrics(p2p1_imp_pred_m) #LOOCV is 0.2543553
## [1] 0.2543553

The LOOCV of the important predictors model is slightly larger than the full model.

Part 2 (10 pts)

Now, consider ridge regression. Using a pre-built implementation of ridge regression, train the model using a large number of possible values for \(\lambda\).

Using \(10\)-fold cross validation, find a reasonable value of \(\lambda\) that should minimize the expected prediction error. You can choose the actual minimum or a slightly less complex model. Defend this choice.

Create a plot that demonstrates the regression coefficients for the ridge regression with your optimal choice of \(\lambda\). Which predictors are important? Which ones are not? I recommend using a sideways bar plot - you can see an example construction here.

Part 2 solution:

p2p2_pred <- data.matrix(o_train[,-38])
p2p2_out <- data.matrix(o_train$imdb_rating)

#100 possible values of lambda
p2p2_ridgem <- glmnet(x=p2p2_pred, y=p2p2_out, family="gaussian", alpha=0, nlambda=100)
print(p2p2_ridgem)
## 
## Call:  glmnet(x = p2p2_pred, y = p2p2_out, family = "gaussian", alpha = 0,      nlambda = 100) 
## 
##     Df  %Dev  Lambda
## 1   37  0.00 195.800
## 2   37  0.54 178.400
## 3   37  0.59 162.600
## 4   37  0.64 148.100
## 5   37  0.71 135.000
## 6   37  0.77 123.000
## 7   37  0.85 112.000
## 8   37  0.93 102.100
## 9   37  1.02  93.020
## 10  37  1.11  84.760
## 11  37  1.22  77.230
## 12  37  1.34  70.370
## 13  37  1.46  64.120
## 14  37  1.60  58.420
## 15  37  1.75  53.230
## 16  37  1.91  48.500
## 17  37  2.09  44.190
## 18  37  2.29  40.270
## 19  37  2.50  36.690
## 20  37  2.73  33.430
## 21  37  2.98  30.460
## 22  37  3.25  27.750
## 23  37  3.54  25.290
## 24  37  3.86  23.040
## 25  37  4.20  21.000
## 26  37  4.57  19.130
## 27  37  4.97  17.430
## 28  37  5.40  15.880
## 29  37  5.86  14.470
## 30  37  6.36  13.190
## 31  37  6.89  12.010
## 32  37  7.45  10.950
## 33  37  8.06   9.974
## 34  37  8.70   9.088
## 35  37  9.38   8.281
## 36  37 10.10   7.545
## 37  37 10.87   6.875
## 38  37 11.67   6.264
## 39  37 12.51   5.708
## 40  37 13.39   5.201
## 41  37 14.31   4.739
## 42  37 15.26   4.318
## 43  37 16.25   3.934
## 44  37 17.27   3.585
## 45  37 18.32   3.266
## 46  37 19.39   2.976
## 47  37 20.49   2.712
## 48  37 21.60   2.471
## 49  37 22.73   2.251
## 50  37 23.88   2.051
## 51  37 25.03   1.869
## 52  37 26.18   1.703
## 53  37 27.34   1.552
## 54  37 28.49   1.414
## 55  37 29.63   1.288
## 56  37 30.77   1.174
## 57  37 31.90   1.070
## 58  37 33.01   0.975
## 59  37 34.10   0.888
## 60  37 35.18   0.809
## 61  37 36.24   0.737
## 62  37 37.28   0.672
## 63  37 38.29   0.612
## 64  37 39.29   0.558
## 65  37 40.26   0.508
## 66  37 41.20   0.463
## 67  37 42.13   0.422
## 68  37 43.02   0.384
## 69  37 43.89   0.350
## 70  37 44.73   0.319
## 71  37 45.55   0.291
## 72  37 46.34   0.265
## 73  37 47.09   0.241
## 74  37 47.82   0.220
## 75  37 48.52   0.200
## 76  37 49.19   0.183
## 77  37 49.83   0.166
## 78  37 50.43   0.152
## 79  37 51.01   0.138
## 80  37 51.55   0.126
## 81  37 52.07   0.115
## 82  37 52.55   0.104
## 83  37 53.00   0.095
## 84  37 53.42   0.087
## 85  37 53.81   0.079
## 86  37 54.18   0.072
## 87  37 54.52   0.066
## 88  37 54.83   0.060
## 89  37 55.11   0.054
## 90  37 55.37   0.050
## 91  37 55.61   0.045
## 92  37 55.83   0.041
## 93  37 56.03   0.038
## 94  37 56.20   0.034
## 95  37 56.36   0.031
## 96  37 56.51   0.028
## 97  37 56.64   0.026
## 98  37 56.75   0.024
## 99  37 56.85   0.021
## 100 37 56.94   0.020
p2p2_cv_ridgem <- cv.glmnet(x=p2p2_pred, y=p2p2_out, family="gaussian", alpha=0, nfolds=10, type.measure="mse")
plot(p2p2_cv_ridgem)

print(p2p2_cv_ridgem)
## 
## Call:  cv.glmnet(x = p2p2_pred, y = p2p2_out, type.measure = "mse",      nfolds = 10, family = "gaussian", alpha = 0) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index Measure      SE Nonzero
## min  0.463    66  0.2072 0.02815      37
## 1se  6.875    37  0.2345 0.03910      37
p2p2_optimal_ridgem <- glmnet::glmnet(x=p2p2_pred, y=p2p2_out, family="gaussian", alpha=0, lambda=p2p2_cv_ridgem$lambda.min)
#plot
cf1 <- c()
cf2 <- c()
for (i in 1:37) {
  cf1 <- append(cf1, rownames(p2p2_optimal_ridgem$beta)[i])
  cf2 <- append(cf2, unname(p2p2_optimal_ridgem$beta)[i])
}
cf <- data.frame(cf1, cf2)

ggplot(data=cf, aes(x=cf1, y=cf2)) +
  geom_bar(stat="identity", width=0.75) + coord_flip() +
  labs(x="Features", y="Coefficient Estimate") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5), 
        axis.title.x = element_text(face="bold", colour="red", size = 12),
        axis.title.y = element_text(face="bold", colour="red", size = 12))

I choose the suggested lambda value that minimizes the expected prediction error under the 10 fold cross validation, because from the plot, expected prediction error picks up obviously as lambda increases.

Based on the plot, the important predictors are season9, season8, season7, season6, season5, season4, season3, season1, randall_einhorn, paul_lieberstein, paul_feig, greg_daniels, and gene_stupnitsky, whereas the rest of the predictors are not.

Part 3 (10 pts)

Finally, consider linear regression with the LASSO penalty. Using a pre-built implementation, train the model using a large number of possible values for \(\lambda\).

Using \(10\)-fold cross validation, find a reasonable value of \(\lambda\) that should minimize the expected prediction error. You can choose the actual minimum or a slightly less complex model (smaller \(\lambda\) is less complex). Defend this choice.

Create a plot that demonstrates the regression coefficients for the LASSO regression with your optimal choice of \(\lambda\). Which predictors are important? Which ones are not?

p2p3_pred <- data.matrix(o_train[,-38])
p2p3_out <- data.matrix(o_train$imdb_rating)

#100 possible values of lambda
p2p3_lassom <- glmnet(x=p2p3_pred, y=p2p3_out, family="gaussian", alpha=1, nlambda=100)
#print(p2p3_lassom)

p2p3_cv_lassom <- cv.glmnet(x=p2p3_pred, y=p2p3_out, family="gaussian", alpha=1, nfolds=10, type.measure="mse")
plot(p2p3_cv_lassom)

print(p2p3_cv_lassom)
## 
## Call:  cv.glmnet(x = p2p3_pred, y = p2p3_out, type.measure = "mse",      nfolds = 10, family = "gaussian", alpha = 1) 
## 
## Measure: Mean-Squared Error 
## 
##      Lambda Index Measure      SE Nonzero
## min 0.03046    21  0.2108 0.03366      20
## 1se 0.11205     7  0.2430 0.04013       4
#0.03343 is the initial p2p3_cv_lassom$lambda.min we get
p2p3_optimal_lassom <- glmnet(x=p2p3_pred, y=p2p3_out, family="gaussian", alpha=1, lambda=p2p3_cv_lassom$lambda.min)
#plot
cf1 <- c()
cf2 <- c()
for (i in 1:37) {
  cf1 <- append(cf1, rownames(p2p3_optimal_lassom$beta)[i])
  cf2 <- append(cf2, unname(p2p3_optimal_lassom$beta)[i])
}
cf <- data.frame(cf1, cf2)

ggplot(data=cf, aes(x=cf1, y=cf2)) +
  geom_bar(stat="identity", width=0.75) + coord_flip() +
  labs(x="Features", y="Coefficient Estimate") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5), 
        axis.title.x = element_text(face="bold", colour="red", size = 12),
        axis.title.y = element_text(face="bold", colour="red", size = 12))

I choose the suggested lambda value that minimizes the expected prediction error under the 10 fold cross validation, because from the plot, expected prediction error picks up obviously as lambda increases.

Based on the plot, the important predictors are season8, season6, season3, season1, paul_feig, and greg_daniels, whereas the rest of the predictors are not.

Part 4 (5 pts)

Which of OLS with all predictors, OLS with your chosen subset of predictors, Ridge, or LASSO has the smallest cross validation estimate of expected prediction error? Do you have any intuition as to why this result occurs?

Using the optimal models from each step, compute an estimate of the expected prediction error using the heldout test data. Does the same relationship hold?

Compare your plots for the regression coefficients for the full OLS model, the ridge regression, and LASSO regression. How do LASSO and Ridge improve the expected prediction error for this problem?

data.frame(
  epe_full_lm = lm_pred_metrics(p2p1_m),
  epe_subset_lm = lm_pred_metrics(p2p1_imp_pred_m),
  epe_optimal_ridge = min(p2p2_cv_ridgem$cvm),
  epe_optimal_lasso = min(p2p3_cv_lassom$cvm))

According to the above comparison, the optimal ridge and lasso models have similar performance on expected prediction errors that are both lower than the OLS models.

#delete variable episode_name
o_test <- o_test %>% select(-episode_name)

#split variable season into 9 variables
dummy_o_test <- caret::dummyVars("~.", data=o_test)
o_test <- data.frame(predict(dummy_o_test, newdata=o_test))

p2p4_pred_mat <- as.matrix(o_test[,-38])
p2p4_pred_df_full <- o_test[,-38]
p2p4_pred_df_imp <- o_test %>% select(seasonSeason.3, seasonSeason.1, paul_lieberstein, mindy_kaling, justin_spitzer, greg_daniels, b_j_novak, imdb_rating)

#predictions
o_test$full_lm <- predict(p2p1_m, newdata=p2p4_pred_df_full)
o_test$subset_lm <- predict(p2p1_imp_pred_m, newdata=p2p4_pred_df_imp)
o_test$optimal_ridge <- predict(p2p2_optimal_ridgem, newx=p2p4_pred_mat)
o_test$optimal_lasso <- predict(p2p3_optimal_lassom, newx=p2p4_pred_mat)

#mse comparison
data.frame(
  mse_full_lm = mse(o_test$imdb_rating, o_test$full_lm),
  mse_subset_lm = mse(o_test$imdb_rating, o_test$subset_lm),
  mse_optimal_ridge = mse(o_test$imdb_rating, o_test$optimal_ridge),
  mse_optimal_lasso = mse(o_test$imdb_rating, o_test$optimal_lasso))

Yes, the same result holds in the heldout test data, where the optimal ridge and lasso models perform better than full and subset OLS models.

Problem 3: A Nifty Identity for Cubic Regression Splines (15 pts)

Regression splines are a broadly applicable method for regression analysis because they can be represented and estimated as an augmented OLS problem.

A generic cubic regression spline with \(K\) knots can be represented as a linear model with \(K + 4\) basis expansion terms:

\[h_1(x) = 1 \text{ ; } h_{\text{2 to 4}}(x) = \{x,x^2,x^3\}\]

\[h_{\text{5 to K + 4}} = \{(x - \xi_1)_+^3 , (x - \xi_2)_+^3,...,(x - \xi_K)_+^3\}\]

\[y_i = \alpha + \sum \limits_{k = 2}^{K + 4} \beta_k h_k(x_i) + \epsilon_i\] Cubic regression splines fit a function to the data that is continuous with respect to \(x\) and continuous in its first two derivatives.

For an arbitrary collection of \(K \le N\) knots, show that the cubic regression spline provides a function that is continuous in the first and second derivative at the knots.

Notes:

  1. You can assume that the function is continuous and twice differentiable away from the knots. This is true and provable, but you can just take that for granted without further explanation.

  2. With an appropriate argument about the relationship between continuity in the 1st and 2nd derivatives, you don’t need to show continuity on the first derivatives.

  3. Remember that we can show continuity by proving that the left and right limits of a function are equivalent!

Problem 3 solution:

\[ \begin{eqnarray} y_i &=& \alpha + \beta_2 h_2(x_i) + \beta_3 h_3(x_i) + \beta_4 h_4(x_i) + \sum \limits_{k = 5}^{K} \beta_k h_k(x_i) + \epsilon_i\\ y_i &=& \alpha + \beta_2x_i + \beta_3 x_i^2 + \beta_4 x_i^3 + \sum \limits_{k = 1}^{K} \beta_{k+4} (x_i - \xi_k)^3 + \epsilon_i\\ \frac{dy_i}{dx_i} &=& \beta_2 + 2\beta_3x_i + 3\beta_4 x_i^2 + 3\sum \limits_{k = 1}^{K} \beta_{k+4} (x_i - \xi_k)^2\\ \frac{d^2y_i}{dx_i^2} &=& 2\beta_3 + 6\beta_4 x_i + 6\sum \limits_{k = 1}^{K} \beta_{k+4} (x_i - \xi_k)\\ \frac{d^3y_i}{dx_i^3} &=& 6\beta_4 + 6\sum \limits_{k = 1}^{K} \beta_{k+4}\\ \end{eqnarray} \] Since the function is three-times differentiable, the second derivative of the function is continuous. Since the function has a second derivative, it’s twice differentiable, which proves that the function’s first derivative is continuous. Since the function is continuous in the first and second derivative for all \(x_i\), the function is continuous in the first and second derivative at the knots.

Problem 4: Nonlinear Regression Methods (35 pts)

Part 1 (10 pts)

Let’s start by looking at a single predictor - the student/faculty ratio S.F.Ratio. Plot log out of state tuition against the student/faculty ratio. Does this look linear?

#initial recoding: use c_train_q4 moving on for consistence
c_train_q4 <- c_train
c_train_q4$Outstate <- log(c_train$Outstate)

approachPlot <- qplot(x=S.F.Ratio, y=Outstate, data=c_train_q4, geom=c("point","smooth"), ylab="out of state tuition", xlab="student/faculty ratio")
approachPlot

# It looks quite linear, but could also be a second/third degree polynomial relationship. Compare the first smooth graph with the second one plotting in linear model, we can see that the relationship has a linear trend.
lmApprochPlot <- qplot(x=S.F.Ratio, y=Outstate, data=c_train_q4, geom=c("point","smooth"), method="lm", ylab="out of state tuition", xlab="student/faculty ratio")
lmApprochPlot

Using a measure of expected prediction error appropriate for a standard linear model, find the order of global polynomial that minimizes EPE. Be sure to note your choice and why you made it. Using this value, train your model on the full training set and plot the prediction curve implied by the polynomial model on your graph.

# we reuse the function from problem set 2 for the wealth of non-simulation based estimates of the expected prediction error to make this decision
lm_pred_metrics <- function(fit_model) {
  
  #log-likelihood
  N <- length(fit_model$residuals)
  sig <- sigma(fit_model)
  res <- fit_model$residuals
  ll <- -N/2*log10(2*pi) - N/2*log10(sig^2) - 1/(2*sig^2)*sum(res^2)
  
  #AIC
  d <- length(fit_model$coefficients) + 1
  AIC <- -2*ll + 2*(d)
  
  #BIC
  BIC <- -2*ll + d*log10(N)
  
  #LOOCV
  LOOCV <- (res[1] / (1-hatvalues(fit_model)[1]))^2
  if (N>=2){
    for (i in 2:N){
      LOOCV <- LOOCV + (res[i] / (1-hatvalues(fit_model)[i]))^2
    }
  }
  LOOCV_final <- LOOCV / N
  
  return(c("AIC"=AIC, "BIC"=BIC, "LOOCV"=LOOCV_final))
}

p4p1_model_1 <- lm(c_train_q4$Outstate~poly(c_train_q4$S.F.Ratio,1))
p4p1_model_2 <- lm(c_train_q4$Outstate~poly(c_train_q4$S.F.Ratio,2))
p4p1_model_3 <- lm(c_train_q4$Outstate~poly(c_train_q4$S.F.Ratio,3))
p4p1_model_4 <- lm(c_train_q4$Outstate~poly(c_train_q4$S.F.Ratio,4))
p4p1_model_5 <- lm(c_train_q4$Outstate~poly(c_train_q4$S.F.Ratio,5))
p4p1_model_6 <- lm(c_train_q4$Outstate~poly(c_train_q4$S.F.Ratio,6))
p4p1_model_7 <- lm(c_train_q4$Outstate~poly(c_train_q4$S.F.Ratio,7))
p4p1_model_8 <- lm(c_train_q4$Outstate~poly(c_train_q4$S.F.Ratio,8))
p4p1_model_9 <- lm(c_train_q4$Outstate~poly(c_train_q4$S.F.Ratio,9))
p4p1_model_10 <- lm(c_train_q4$Outstate~poly(c_train_q4$S.F.Ratio,10))

p4p1_df <- t(cbind(lm_pred_metrics(p4p1_model_1), lm_pred_metrics(p4p1_model_2), lm_pred_metrics(p4p1_model_3), lm_pred_metrics(p4p1_model_4), lm_pred_metrics(p4p1_model_5), lm_pred_metrics(p4p1_model_6), lm_pred_metrics(p4p1_model_7), lm_pred_metrics(p4p1_model_8), lm_pred_metrics(p4p1_model_9), lm_pred_metrics(p4p1_model_10)))
num <- c(1:10)
p4p1_df <- data.frame(num, p4p1_df)
#plot LOOCV for various degree of polynomial
ggplot(data=p4p1_df) + 
  geom_point(aes(x=num, y=LOOCV.1)) + 
  geom_line(aes(x=num, y=LOOCV.1)) +
  xlab("degree polynomial") + ylab('LOOCV') +
  theme_bw()

# We use LOOCV as our measure of EPE and find that third order of global polynomial that minimizes EPE. We choose LOOCV as there is less worry about bias and variance than other approaches and it's asymptotically unbiased for the  the EPE.From the graph above, we could see that the LOOCV reaches minimum for degree polynomial=3.
#optimal polynomial model 
#train the model with third degree of polynomial
optimalPolyModel <- lm(data=c_train_q4, Outstate ~ poly(S.F.Ratio,3))
polyPredResult <- data.frame(S.F.Ratio=c_train_q4$S.F.Ratio,
                         predOutState=optimalPolyModel$fitted.values,
                         originalOutState=c_train_q4$Outstate)

#plot our polynomial model
polyPredResultPlot <- ggplot(data=polyPredResult) +
  geom_point(aes(x=S.F.Ratio, y=predOutState, col='prediction')) +
  geom_point(aes(x=S.F.Ratio, y=originalOutState, col='original')) +
  scale_colour_manual(name="Legend", values=c('black','red')) +
  ylab('OutState') + theme_bw()
polyPredResultPlot

Next, estimate a cubic natural spline and a smoothing spline using the entire training data. For the natural spline, you need to choose the number of knots or degrees of freedom. I would recommend setting these to 5 to start and playing with it until you get something that looks right. For the smoothing spline, you should choose the final form using a built-in cross-validation method (most likely GCV). Add the prediction curve to your plot. How do the drawn curves differ between methods?

#natural spline(cubic)
#find the optimal df with our measure of EPE
p4_1_num <- seq(1,20,1)
p4_1_aic <- c()
p4_1_bic <- c()
p4_1_loocv <- c()

for (i in 1:20) {
  cubicSpline <- lm(Outstate ~ ns(S.F.Ratio,df=i), data=c_train_q4)
  tempCheck <- unname(lm_pred_metrics(cubicSpline))
  
  p4_1_aic <- append(p4_1_aic, tempCheck[1])
  p4_1_bic <- append(p4_1_bic, tempCheck[2])
  p4_1_loocv <- append(p4_1_loocv, tempCheck[3])
}
p4_1_dat <- data.frame(p4_1_num, p4_1_aic, p4_1_bic, p4_1_loocv)

#plot how each df perform with each measure of EPE
ggplot(data=p4_1_dat) +
  geom_point(aes(x=p4_1_num, y=p4_1_aic)) +
  geom_line(aes(x=p4_1_num, y=p4_1_aic)) +
  xlab('df') + ylab('aic') +
  theme_bw()

ggplot(data=p4_1_dat) +
  geom_point(aes(x=p4_1_num, y=p4_1_bic)) +
  geom_line(aes(x=p4_1_num, y=p4_1_bic)) +
  xlab('df') + ylab('bic') +
  theme_bw()

ggplot(data=p4_1_dat) +
  geom_point(aes(x=p4_1_num, y=p4_1_loocv)) +
  geom_line(aes(x=p4_1_num, y=p4_1_loocv)) +
  xlab('df') + ylab('loocv') +
  theme_bw()

#optimal Natural Spline model
optimalCubicModel <- lm(Outstate ~ ns(S.F.Ratio,df=4), data=c_train_q4)
nsPredResult <- data.frame(S.F.Ratio=c_train_q4$S.F.Ratio,
                         predOutState=optimalCubicModel$fitted.values,
                         originalOutState=c_train_q4$Outstate)

predResultPlot <- ggplot(data=nsPredResult) +
  geom_point(aes(x=S.F.Ratio, y=predOutState, col='prediction')) +
  geom_point(aes(x=S.F.Ratio, y=originalOutState, col='original')) +
  scale_colour_manual(name="Legend", values=c('black','red')) +
  ylab('OutState') + theme_bw()
predResultPlot

#optimal smoothing spline model
optimalSSModel <- smooth.spline(x=c_train_q4$S.F.Ratio, y=c_train_q4$Outstate)
ssPredResult <- data.frame(S.F.Ratio=c_train_q4$S.F.Ratio,predOutState=predict(optimalSSModel, x=c_train_q4$S.F.Ratio), originalOutState=c_train_q4$Outstate)
predResultPlot <- ggplot(data=ssPredResult) +
  geom_point(aes(x=S.F.Ratio, y=predOutState.y, col='prediction')) +
  geom_point(aes(x=S.F.Ratio, y=originalOutState, col='original')) +
  scale_colour_manual(name="Legend", values=c('black','red')) +
  ylab('OutState') + theme_bw()
predResultPlot

#Comparing  the optimal model we got through cubic natural spline and smoothing spline, we can see that optimal smoothing spline model goes more smoothly than the natural spline one, especially on the left end where S.F.ratio smaller than 15.

Finally, use your polynomial model and splines to create predictions for the test set and quantify the mean squared error for the test set. Which model performs best? Worst? Provide some rationale for this outcome.

#recode testing dataframe
c_test_q4 <- c_test
c_test_q4$Outstate <- log(c_test$Outstate)
#train three optimal models and calculate mse
test_poly_MSE = mse(predict(optimalPolyModel,newdata=c_test_q4), c_test_q4$Outstate)
test_ns_MSE = mse(predict(optimalCubicModel,newdata=c_test_q4), c_test_q4$Outstate)
test_ss_MSE = mse(predict(optimalSSModel, x=c_test_q4$S.F.Ratio)$y, c_test_q4$Outstate)

table<-data.frame(polyTestResult=test_poly_MSE, 
                  naturalSplineResult=test_ns_MSE, 
                  smoothingSplineResult=test_ss_MSE)
table
#From the table above, we can see that the natural cubic spline model and the smoothing spline model perform similarly well in terms of mse, whereas the third order polynomial model doesn't. It is because the natural cubic spline model and the smoothing spline model divide the dataset into partitions and fit linear models to each partition, which is superior than the third order polynomial model that treats the entire training dataset as one entity. 

Part 2 (10 pts)

Now, let’s consider the multivariate case with all of the predictors. Let’s improve on the standard linear model by using LASSO to do some variable selection and shrinkage. Fit the LASSO model to the training data and use K-fold cross validation to select a value of \(\lambda\). Be sure to explain why you made the choice that you did. How many variables are used in the “optimal” model? Be sure to record the optimal \(\lambda\) for later use.

# set things as matrix
lasso_predictor <- data.matrix(c_train_q4[,-9])
lasso_outcome <- data.matrix(c_train_q4$Outstate)

c_lasso_cv <- cv.glmnet(x=lasso_predictor, y=lasso_outcome, family="gaussian", alpha=1, nfolds=10, type.measure="mse")
print(c_lasso_cv)
## 
## Call:  cv.glmnet(x = lasso_predictor, y = lasso_outcome, type.measure = "mse",      nfolds = 10, family = "gaussian", alpha = 1) 
## 
## Measure: Mean-Squared Error 
## 
##      Lambda Index Measure       SE Nonzero
## min 0.00677    40 0.04602 0.004522      14
## 1se 0.03613    22 0.05003 0.004566       8
plot(c_lasso_cv)

optimalLambda<-c_lasso_cv$lambda.min
smallVarLambda<-c_lasso_cv$lambda.1se
c_lasso_optimal <- glmnet(x=lasso_predictor, y=lasso_outcome, family="gaussian", alpha=1, lambda=optimalLambda)

#there are 13 variables in our optimal list (shown below)
variableList<-c_lasso_optimal$beta
variableList
## 19 x 1 sparse Matrix of class "dgCMatrix"
##                         s0
## Private       2.955844e-01
## Apps          .           
## Accept        1.670720e-05
## Enroll        .           
## Top10perc     1.172253e-04
## Top25perc     8.045647e-05
## F.Undergrad  -4.439701e-06
## P.Undergrad   .           
## Room.Board    9.530433e-05
## Books        -1.729132e-06
## Personal      .           
## PhD           2.238420e-03
## Terminal      1.112712e-03
## S.F.Ratio    -1.297173e-02
## perc.alumni   3.792581e-03
## Expend        1.008414e-05
## Grad.Rate     3.423681e-03
## CollegeNames  .           
## AcceptRate    1.470308e-01

Part 3 (10 pts)

Now, let’s see if we can improve on the standard LASSO regression model using a GAM. There are three approaches you can take here:

  1. Use all of the predictors.
  2. Only use the predictors selected by LASSO under your optimal model.
  3. Try to fit the GAM with an even smaller model using a subset of predictors from a higher sparsity point on the LASSO path.

Since we can only estimate a GAM with terms and interactions that we know to put in, it can be really difficult to find the optimal model. This means that Option 2 or 3 is probably your best bet for this particular problem. The benefit of taking this subset approach is that we can really try to model meaningful non-linearities and interactions between features.

Use the LASSO regularization path for your LASSO model estimated in part 2 to pick the model with at most 6 predictors that minimizes expected prediction error. To select the included coefficients, I highly recommend that you leverage the plots produced by glmnet to find the corresponding \(\lambda\). Then, using the chosen value of \(\lambda\), find the set of non-zero coefficients. Regardless of your choice, your model should include S.F.Ratio (though I’m pretty sure that this variable will pop up pretty early along the LASSO regularization path!).

#We decide to go with the second approach
# From the plot() and cv function in part two, we alreay know the min mse Lambda(optimalLambda) and largest value of lambda that error within 1 standard error of the min.(smallVarLambda) as the following:
optimalLambda<-c_lasso_cv$lambda.min
smallVarLambda<-c_lasso_cv$lambda.1se
# models trained with these two lambda values 
c_lasso_optimal <- glmnet(x=lasso_predictor, y=lasso_outcome, family="gaussian", alpha=1, lambda=optimalLambda)
c_lasso_largeLambda <- glmnet(x=lasso_predictor, y=lasso_outcome, family="gaussian", alpha=1, lambda=smallVarLambda)
# print the included predictors
c_lasso_optimal$beta
## 19 x 1 sparse Matrix of class "dgCMatrix"
##                         s0
## Private       2.955844e-01
## Apps          .           
## Accept        1.670720e-05
## Enroll        .           
## Top10perc     1.172253e-04
## Top25perc     8.045647e-05
## F.Undergrad  -4.439701e-06
## P.Undergrad   .           
## Room.Board    9.530433e-05
## Books        -1.729132e-06
## Personal      .           
## PhD           2.238420e-03
## Terminal      1.112712e-03
## S.F.Ratio    -1.297173e-02
## perc.alumni   3.792581e-03
## Expend        1.008414e-05
## Grad.Rate     3.423681e-03
## CollegeNames  .           
## AcceptRate    1.470308e-01
c_lasso_largeLambda$beta
## 19 x 1 sparse Matrix of class "dgCMatrix"
##                         s0
## Private       2.387063e-01
## Apps          .           
## Accept        .           
## Enroll        .           
## Top10perc     .           
## Top25perc     .           
## F.Undergrad   .           
## P.Undergrad   .           
## Room.Board    9.229394e-05
## Books         .           
## Personal      .           
## PhD           1.240153e-03
## Terminal      4.946984e-04
## S.F.Ratio    -1.062123e-02
## perc.alumni   3.163559e-03
## Expend        9.681418e-06
## Grad.Rate     3.169940e-03
## CollegeNames  .           
## AcceptRate    .
#In conclusion, we pick the following 5 predictors:
  #Private/S.F.Ratio/PhD/perc.alumni/Grad.Rate
c_test_q4_selected<-data.frame(Outstate=c_test_q4$Outstate,
                               Private=c_test_q4$Private,
                               S.F.Ratio=c_test_q4$S.F.Ratio,
                               PhD=c_test_q4$PhD,
                               perc.alumni=c_test_q4$perc.alumni,
                               Grad.Rate=c_test_q4$Grad.Rate)
c_train_q4_selected<-data.frame(Outstate=c_train_q4$Outstate,
                                Private=c_train_q4$Private,
                                S.F.Ratio=c_train_q4$S.F.Ratio,
                                PhD=c_train_q4$PhD,
                                perc.alumni=c_train_q4$perc.alumni,
                                Grad.Rate=c_train_q4$Grad.Rate)

Using the selected variables, estimate a GAM model that best predicts log out of state tuition. Try different combinations of linear terms, spline terms, and thin-plate/tensor spline terms to try to minimize the GCV associated with your GAM. Most implementations will return this as part of the model object.

Note: There is no exhaustive way to find the actual GCV minimizer using GAMs. You’ll just need to use intuition to determine what variables have significant interactions that should be included in the model building step.

#try to build model with GAM
# if we include all variables in smooth, the model and gcv is:
fullGam<-gam(Outstate ~ Private+s(S.F.Ratio) + s(PhD) + s(perc.alumni) + 
               s(Grad.Rate),data=c_train_q4_selected,family = gaussian())
fullGamGcv<-fullGam$gcv.ubre.dev
print("The gcv for optimal model is")
## [1] "The gcv for optimal model is"
fullGamGcv
## [1] 0.05299458
#if we want to include one interaction term w te() tensor spline
temMinGcv<-100
#summary()
for (i in 3:6){
  for (j in 3:6){
    if (i==j){
      break
    }
    df<-data.frame(c_train_q4_selected[i],c_train_q4_selected[j],c_train_q4_selected[1])
    names(df)[1] <- "x1"
    names(df)[2] <- "x2"
    names(df)[3] <- "y"
    tempModel<-gam(Outstate ~ Private+s(S.F.Ratio) + s(PhD) + s(perc.alumni) + s(Grad.Rate)+te(df$x1,df$x2),
               data=c_train_q4_selected,
               family = gaussian())
    tempGcv<-tempModel$gcv.ubre.dev
    if (temMinGcv>tempGcv){
      temMinGcv<-tempGcv
      num1<-i
      num2<-j
      bestTEModel<-tempModel
    }
  }
}
# result
print("The gcv for optimal model is")
## [1] "The gcv for optimal model is"
temMinGcv # also the smallest gcv we got so far
## [1] 0.05067898
print("The interaction is between:")
## [1] "The interaction is between:"
names(c_train_q4_selected[num1])
## [1] "perc.alumni"
names(c_train_q4_selected[num2])
## [1] "S.F.Ratio"
#if we want to include one interaction term w s()
sMinGcv<-100
#summary()
for (i in 3:6){
  for (j in 3:6){
    if (i==j){
      break
    }
    df<-data.frame(c_train_q4_selected[i],c_train_q4_selected[j],c_train_q4_selected[1])
    names(df)[1] <- "x1"
    names(df)[2] <- "x2"
    names(df)[3] <- "y"
    tempModel<-gam(Outstate ~ Private+s(S.F.Ratio) + s(PhD) + s(perc.alumni) + s(Grad.Rate)+s(df$x1,df$x2),
               data=c_train_q4_selected,
               family = gaussian())
    tempGcv<-tempModel$gcv.ubre.dev
    if (sMinGcv>tempGcv){
      sMinGcv<-tempGcv
      snum1<-i
      snum2<-j
      bestSModel<-tempModel
    }
  }
}
#plot(bestSModel,pages=1)

print("The gcv for optimal model is")
## [1] "The gcv for optimal model is"
sMinGcv 
## [1] 0.05100146
print("The interaction is between:")
## [1] "The interaction is between:"
names(c_train_q4_selected[snum1])
## [1] "Grad.Rate"
names(c_train_q4_selected[snum2])
## [1] "PhD"
# so we have three optimal models with/without interaction in various formula:
#gcv we got for each model:
gcvSum<-data.frame(fullGamGCV=fullGamGcv,bestTEGCV=temMinGcv ,bestSGCV=sMinGcv)
gcvSum
#best Model with one s(interaction) term
#gam(Outstate ~ Private+s(S.F.Ratio) + s(PhD) + s(perc.alumni) + s(Grad.Rate)+s(Grad.Rate,Phd)
plot(bestSModel,pages=1)

#fullGam model simply s() on everything
#gam(Outstate ~ Private+s(S.F.Ratio) + s(PhD) + s(perc.alumni) + s(Grad.Rate)
plot(fullGam,pages=1)

#best Model with one TE(interaction) term
#gam(Outstate ~ Private+s(S.F.Ratio) + s(PhD) + s(perc.alumni) + s(Grad.Rate)+te(perc.alumni,S.F.Ratio)
plot(bestTEModel,pages=1)

Create a plot that shows the function for each predictor. Do the marginal relationships make sense? For any 2-predictor spline terms, do they capture anything that wouldn’t be captured by a linear model?

#gam(Outstate ~ Private+s(S.F.Ratio) + s(PhD) + s(perc.alumni) + s(Grad.Rate)+s(Grad.Rate,Phd)
optGAMModel<-bestSModel 
plot(optGAMModel)

#We decide to go with this model as it gives as the relatively low gcv with interactions between Grad.Rate and Phd (somewhat) makes more sense than other models we got.
#The marginal relationship makes sense most of the predictors. For PhD and perc.alumni, it makes sense the functions don't change much while the value of those variable change.For S.F.Ratio, it also makes sense that when the ratio is lower it matters more to out of state tuition.(vice versa for Grad.Rate)
#For 2-predictor spline terms, we could see from the last graph that our optimal model does capture relation that won't be captured by a linear model.

Look at your function for S.F.Ratio. Does it look like the smoothing spline you uncovered in part 1? What does this say about the plausibility of the additivity assumption for this specific variable?

# s(df$x1, df$x2) is s(Grad.Rate,Phd)
summary(optGAMModel)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Outstate ~ Private + s(S.F.Ratio) + s(PhD) + s(perc.alumni) + 
##     s(Grad.Rate) + s(df$x1, df$x2)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.92354    0.02208  404.19   <2e-16 ***
## PrivateYes   0.34441    0.02785   12.37   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                   edf Ref.df      F p-value    
## s(S.F.Ratio)    3.346  4.244 15.335  <2e-16 ***
## s(PhD)          0.500  0.500  1.453  0.3944    
## s(perc.alumni)  1.000  1.000  5.072  0.0247 *  
## s(Grad.Rate)    5.332  6.285  1.352  0.1975    
## s(df$x1,df$x2) 11.000 15.024  5.738  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Rank: 65/67
## R-sq.(adj) =  0.697   Deviance explained = 70.8%
## GCV = 0.051001  Scale est. = 0.049031  n = 600
#The function we got from the first graph above is quite similar with the shape of what we uncovered in part1. It might imply that S.F.Ratio is additive to the GAM model in the way that we structure it.(as we could see in the summary, we have a quite low p-value for S.F.Ratio.)

Part 4 (5 pts)

Finally, let’s compare both models to see which one performs best on the out of sample data. Create predictions using your LASSO model and GAM for the out of sample test data. Use these predictions to compute the out of sample MSE for each method. Which performs best?

#GAM model:optGAMModel
GamPredmodel<-gam(Outstate ~ Private+s(S.F.Ratio) + s(PhD) + s(perc.alumni) + s(Grad.Rate)+s(Grad.Rate,PhD),data=c_train_q4_selected,family = gaussian())

mse_GAM<-mse(c_test_q4_selected$Outstate,predict.gam(GamPredmodel,c_test_q4))

#LASaO model: c_lasso_optimal
lasso_predictor <- data.matrix(c_test_q4[,-9])
mse_Lasoo<-mse(c_test_q4$Outstate,predict(c_lasso_optimal,newx=lasso_predictor))

p4p4_mseOutput<-data.frame(GAM_mse=mse_GAM,Lasoo_mse=mse_Lasoo)
p4p4_mseOutput
#If we take mse data below as reference, Lasso performs better than our GAM model.

In a few sentences, discuss the practical limitations of the LASSO and GAM approach used in part 3. When will this kind of process not be viable?

The LASSO and GAM approach used in part 3 allows us to pick variable with our own intuition based on some level of logical algorithm analysis. When interactions between variables are significant, we might miss some of those interactions by dropping out some of the variables in the LASSO process. Besides, with GAM, our final *optimal model is highly decided by our human intuition. It does allow us for more freedom on model selection, but also might leads to relatively high bias when we are biased/ don’t have enough knowledge about the data background.